Harvesting Entities from the Web Using Unique Identifiers – IBEX Extraction des entités du Web à l’aide d’identifiants uniques – IBEX

نویسندگان

  • Aliaksandr Talaika
  • Joanna Biega
  • Antoine Amarilli
  • Fabian M. Suchanek
چکیده

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73–96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web. This work was published at WebDB 2015 [40].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Harvesting Entities from the Web Using Unique Identifiers - IBEX

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extract...

متن کامل

Harvesting Entities from the Web Using Unique Identifiers

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with humanreadable names for the entities at large scale. Starting with a simple extracti...

متن کامل

Sequence analysis of peste des petits ruminants virus from ibexes in Xinjiang, China.

Peste des petits ruminants (PPR) is an infectious disease caused by peste des petits ruminants virus (PPRV). While PPR mainly affects domestic goats and sheep, it also affects wild ungulates such as ibex, blue sheep, and gazelle, although there are few reports regarding PPRV infection in wild animals. Between January 2015 and February 2015, it was found for the first time that wild ibexes died ...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Epizootiologic investigations of selected abortive agents in free-ranging Alpine ibex (Capra ibex ibex) in Switzerland.

In the early 2000s, several colonies of Alpine ibex (Capra ibex ibex) in Switzerland ceased growing or began to decrease. Reproductive problems due to infections with abortive agents might have negatively affected recruitment. We assessed the presence of selected agents of abortion in Alpine ibex by serologic, molecular, and culture techniques and evaluated whether infection with these agents m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015